Analysis and Enhancement of Conditional Random Fields Gene Mention Taggers in BioCreative II Challenge Evaluation
نویسندگان
چکیده
Background: Tagging gene and gene product mentions in scientific text is an important initial step of literature mining. In BioCreative 2 challenge, the conditional random fields model (CRF) was the most prevailing method in the gene mention task. In this paper, we analyze two best performing CRF-based systems in BioCreative 2. We examine their key claims and propose enhancement based on the analysis results. Results: We implemented their systems in MALLET as specified in their report and in CRF++, a different CRF package, to empirically analyze their claims. We found that their feature set is effective for models trained by MALLET, but a smaller set works better for those by CRF++. We confirmed the effectiveness of pairing parentheses as a post processing step. We found that backward parsing is not always superior to forward parsing. The benefit of applying bidirectional parsing is the creation of a wider variety of complementary models. We elaborated the notion of divergent models by relating it to the difference of the increments of ture positives and false positives of the union model. Conclusions: To further enhance the performance, we can integrate more models based on the elaborated notion of divergent models that we derived to minimize the number of models required.
منابع مشابه
HTSZ_CEM System for Chemical Entity Mention Recognition in Patents
In this paper, a machine learning-based system was proposed for the challenge task of chemical entity mention recognition in patents (CEMP) in BioCreative V. The CEMP task was recognized as a sequence labeling problem and conditional random fields (CRF) were employed for it. Evaluation on the CEMP challenge corpus showed that our system (team 293) achieved a micro F-measure of 87.03%.
متن کاملBCC-NER: bidirectional, contextual clues named entity tagger for gene/protein mention recognition
Tagging biomedical entities such as gene, protein, cell, and cell-line is the first step and an important pre-requisite in biomedical literature mining. In this paper, we describe our hybrid named entity tagging approach namely BCC-NER (bidirectional, contextual clues named entity tagger for gene/protein mention recognition). BCC-NER is deployed with three modules. The first module is for text ...
متن کاملCombining Machine Learning with Dictionary Lookup for Chemical Compound and Drug Name Recognition Task
Following the interest taken into Name Entity Recognition in academic literature in the Gene Mention recognition task of BioCreative I and II, the BioCreative IV hopes to make the implementation of the system in the field of detecting mentions of chemical compounds and drugs. Considering that the machine learning methods have obtained great success in the correct identification of gene and prot...
متن کاملRecognizing Biomedical Named Entities Using Skip-Chain Conditional Random Fields
Linear-chain Conditional Random Fields (CRF) has been applied to perform the Named Entity Recognition (NER) task in many biomedical text mining and information extraction systems. However, the linear-chain CRF cannot capture long distance dependency, which is very common in the biomedical literature. In this paper, we propose a novel study of capturing such long distance dependency by defining ...
متن کاملTowards Gene Recognition from Rare and Ambiguous Abbreviations using a Filtering Approach
Retrieving information about highly ambiguous gene/protein homonyms is a challenge, in particular where their non-protein meanings are more frequent than their protein meaning (e. g., SAH or HF). Due to their limited coverage in common benchmarking data sets, the performance of existing gene/protein recognition tools on these problematic cases is hard to assess. We uniformly sample a corpus of ...
متن کامل